The eICU Collaborative Research Database is a multi-center database comprising deidentified health data associated with over 200,000 admissions to ICUs across the United States between 2014-2015. The database includes vital sign measurements, care plan documentation, severity of illness measures, diagnosis information, and treatment information. Data is collected through the Philips eICU program, a critical care telehealth program that delivers information to caregivers at the bedside.
This project is a comprehensive study on the eICU dataset, focusing on data extraction, cleaning, and exploratory data analysis. The final goal is to identify and implement a feature selection method that can be applied to the dataset for further ML modeling.
The eICU Collaborative Research Database (eICU-CRD) is a freely available dataset provided by the MIT Laboratory for Computational Physiology. The dataset contains high-granularity data from ICU patients across multiple hospitals.
The first step in this project was to identify and extract the relevant datasets from the eICU database. To determine which datasets to extract, I referred to the following resource:
Based on the findings in this study, I selected datasets that were most relevant to my exploratory analysis and future ML tasks.
After extracting the datasets, I performed the following data cleaning steps:
Handling Missing Values:
Data Normalization:
Outlier Detection:
Categorical Variable Encoding:
An exploratory analysis was conducted to better understand the structure, relationships, and distributions within the dataset. This included:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from scipy.stats import entropy
from scipy.stats import chi2
import plotly.express as px
raw_df = pd.read_csv("eICU_24hours_revised.csv")
raw_df.head(30)
| PATIENTUNITSTAYID | UNIQUEPID | AGE | GENDER | ETHNICITY | UNITDISCHARGESTATUS | LAB_8HOURS | ALBUMIN | BUN | TOTALBILIRUBIN | ... | SODIUM | WBCCOUNT | VITAL_HOURS | VITAL_HEARTRATE | VITAL_RESPIRATION | VITAL_SAO2 | VITAL_TEMPERATURE | VITAL_SYSTEMIC_SYSTOLIC | VITAL_SYSTEMIC_DIASTOLIC | VITAL_SYSTEMIC_MEAN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 1.0 | NaN | 25.0 | 23.0 | ... | 138.0 | 23.2 | 19.0 | 81.0 | 21.5 | 92.5 | NaN | 85.0 | 37.5 | 53.5 |
| 1 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 1.0 | NaN | 25.0 | 23.0 | ... | 138.0 | 23.2 | 6.0 | 115.5 | 23.0 | 88.0 | NaN | 104.5 | 51.5 | 71.5 |
| 2 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 1.0 | NaN | 25.0 | 23.0 | ... | 138.0 | 23.2 | 0.0 | 71.0 | 14.0 | 93.0 | NaN | 96.0 | 46.0 | 65.0 |
| 3 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 1.0 | NaN | 25.0 | 23.0 | ... | 138.0 | 23.2 | 11.0 | 97.5 | 19.5 | 90.0 | NaN | 74.5 | 39.5 | 51.0 |
| 4 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 1.0 | NaN | 25.0 | 23.0 | ... | 138.0 | 23.2 | 20.0 | 75.5 | 21.0 | 96.0 | NaN | 79.5 | 36.5 | 51.5 |
| 5 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 1.0 | NaN | 25.0 | 23.0 | ... | 138.0 | 23.2 | 13.0 | 92.0 | 18.0 | 93.0 | NaN | 71.5 | 33.0 | 46.5 |
| 6 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 1.0 | NaN | 25.0 | 23.0 | ... | 138.0 | 23.2 | 9.0 | 98.0 | 19.0 | 92.0 | NaN | 70.0 | 31.0 | 44.0 |
| 7 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 1.0 | NaN | 25.0 | 23.0 | ... | 138.0 | 23.2 | 18.0 | 73.0 | 17.0 | 92.0 | NaN | 73.0 | 37.0 | 50.0 |
| 8 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 1.0 | NaN | 25.0 | 23.0 | ... | 138.0 | 23.2 | 7.0 | 128.0 | 27.5 | 88.5 | NaN | 105.5 | 51.0 | 72.5 |
| 9 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 1.0 | NaN | 25.0 | 23.0 | ... | 138.0 | 23.2 | 1.0 | 74.0 | 13.0 | 93.5 | NaN | 100.0 | 46.0 | 66.0 |
| 10 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 1.0 | NaN | 25.0 | 23.0 | ... | 138.0 | 23.2 | 23.0 | 71.5 | 16.0 | 97.0 | NaN | 79.0 | 35.0 | 51.0 |
| 11 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 1.0 | NaN | 25.0 | 23.0 | ... | 138.0 | 23.2 | 15.0 | 88.5 | 18.5 | 92.0 | NaN | 67.5 | 35.0 | 47.0 |
| 12 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 1.0 | NaN | 25.0 | 23.0 | ... | 138.0 | 23.2 | 16.0 | 84.5 | 17.0 | 93.0 | NaN | 70.0 | 35.0 | 48.0 |
| 13 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 0.0 | 2.3 | 30.1 | 23.0 | ... | 136.0 | 8.9 | 6.0 | 115.5 | 23.0 | 88.0 | NaN | 104.5 | 51.5 | 71.5 |
| 14 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 1.0 | NaN | 25.0 | 23.0 | ... | 138.0 | 23.2 | 17.0 | 81.5 | 20.0 | 93.0 | NaN | 74.0 | 37.0 | 51.0 |
| 15 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 1.0 | NaN | 25.0 | 23.0 | ... | 138.0 | 23.2 | 2.0 | 77.0 | 15.5 | 93.0 | NaN | 93.5 | 40.5 | 59.5 |
| 16 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 1.0 | NaN | 25.0 | 23.0 | ... | 138.0 | 23.2 | 3.0 | 83.0 | 14.0 | 94.5 | NaN | 96.0 | 42.5 | 63.0 |
| 17 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 1.0 | NaN | 25.0 | 23.0 | ... | 138.0 | 23.2 | 5.0 | 102.0 | 21.0 | 91.0 | NaN | 97.0 | 49.5 | 69.0 |
| 18 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 1.0 | NaN | 25.0 | 23.0 | ... | 138.0 | 23.2 | 22.0 | 76.5 | 23.0 | 95.0 | NaN | 79.0 | 36.0 | 51.5 |
| 19 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 1.0 | NaN | 25.0 | 23.0 | ... | 138.0 | 23.2 | 4.0 | 91.5 | 17.5 | 93.0 | NaN | 98.0 | 49.0 | 68.0 |
| 20 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 1.0 | NaN | 25.0 | 23.0 | ... | 138.0 | 23.2 | 10.0 | 98.0 | 20.5 | 91.0 | NaN | 62.5 | 26.5 | 38.5 |
| 21 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 1.0 | NaN | 25.0 | 23.0 | ... | 138.0 | 23.2 | 21.0 | 80.0 | 22.0 | 94.0 | NaN | 72.0 | 34.0 | 48.0 |
| 22 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 1.0 | NaN | 25.0 | 23.0 | ... | 138.0 | 23.2 | 14.0 | 94.5 | 23.5 | 93.0 | NaN | 74.0 | 38.0 | 52.0 |
| 23 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 1.0 | NaN | 25.0 | 23.0 | ... | 138.0 | 23.2 | 8.0 | 102.0 | 21.0 | 93.5 | NaN | 77.5 | 35.5 | 50.0 |
| 24 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 1.0 | NaN | 25.0 | 23.0 | ... | 138.0 | 23.2 | 12.0 | 93.5 | 19.0 | 91.0 | NaN | 55.5 | 27.5 | 37.5 |
| 25 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 0.0 | 2.3 | 30.1 | 23.0 | ... | 136.0 | 8.9 | 8.0 | 102.0 | 21.0 | 93.5 | NaN | 77.5 | 35.5 | 50.0 |
| 26 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 0.0 | 2.3 | 30.1 | 23.0 | ... | 136.0 | 8.9 | 12.0 | 93.5 | 19.0 | 91.0 | NaN | 55.5 | 27.5 | 37.5 |
| 27 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 0.0 | 2.3 | 30.1 | 23.0 | ... | 136.0 | 8.9 | 0.0 | 71.0 | 14.0 | 93.0 | NaN | 96.0 | 46.0 | 65.0 |
| 28 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 0.0 | 2.3 | 30.1 | 23.0 | ... | 136.0 | 8.9 | 11.0 | 97.5 | 19.5 | 90.0 | NaN | 74.5 | 39.5 | 51.0 |
| 29 | 2257349 | 021-249916 | > 89 | Male | NaN | Alive | 0.0 | 2.3 | 30.1 | 23.0 | ... | 136.0 | 8.9 | 20.0 | 75.5 | 21.0 | 96.0 | NaN | 79.5 | 36.5 | 51.5 |
30 rows × 30 columns
raw_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2002029 entries, 0 to 2002028 Data columns (total 30 columns): # Column Dtype --- ------ ----- 0 PATIENTUNITSTAYID int64 1 UNIQUEPID object 2 AGE object 3 GENDER object 4 ETHNICITY object 5 UNITDISCHARGESTATUS object 6 LAB_8HOURS float64 7 ALBUMIN float64 8 BUN float64 9 TOTALBILIRUBIN float64 10 LACTATE float64 11 BICARBONATE float64 12 CHLORIDE float64 13 CREATININE float64 14 GLUCOSE float64 15 HEMOGLOBIN float64 16 HEMATOCRIT float64 17 PLATELETCOUNT float64 18 POTASSIUM float64 19 PTT float64 20 SODIUM float64 21 WBCCOUNT float64 22 VITAL_HOURS float64 23 VITAL_HEARTRATE float64 24 VITAL_RESPIRATION float64 25 VITAL_SAO2 float64 26 VITAL_TEMPERATURE float64 27 VITAL_SYSTEMIC_SYSTOLIC float64 28 VITAL_SYSTEMIC_DIASTOLIC float64 29 VITAL_SYSTEMIC_MEAN float64 dtypes: float64(24), int64(1), object(5) memory usage: 458.2+ MB
raw_df[raw_df['UNITDISCHARGESTATUS'] =='Alive'].describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| PATIENTUNITSTAYID | 1836320.0 | 1.821864e+06 | 977502.824033 | 141233.00 | 1063650.000 | 1716722.00 | 2773016.00 | 3353190.00 |
| LAB_8HOURS | 1832700.0 | 8.098429e-01 | 0.801971 | 0.00 | 0.000 | 1.00 | 1.00 | 2.00 |
| ALBUMIN | 836932.0 | 3.024546e+00 | 0.714837 | 0.40 | 2.500 | 3.00 | 3.50 | 5.70 |
| BUN | 1407696.0 | 2.382507e+01 | 5.166149 | 3.00 | 21.000 | 24.00 | 27.00 | 63.00 |
| TOTALBILIRUBIN | 1482024.0 | 2.844705e+01 | 23.366748 | 1.00 | 14.000 | 21.00 | 35.00 | 292.50 |
| LACTATE | 1489852.0 | 1.040999e+02 | 7.264179 | 58.00 | 100.000 | 104.00 | 108.00 | 151.00 |
| BICARBONATE | 1487592.0 | 1.631430e+00 | 1.739738 | 0.10 | 0.790 | 1.08 | 1.71 | 40.38 |
| CHLORIDE | 1490335.0 | 1.493133e+02 | 74.021693 | 14.00 | 108.000 | 131.00 | 166.00 | 1964.00 |
| CREATININE | 1491926.0 | 3.330030e+01 | 7.124183 | 7.40 | 28.000 | 33.00 | 38.30 | 66.10 |
| GLUCOSE | 1497979.0 | 1.101072e+01 | 2.411283 | 2.50 | 9.200 | 10.90 | 12.70 | 22.80 |
| HEMOGLOBIN | 572958.0 | 2.432242e+00 | 2.053874 | 0.00 | 1.200 | 1.80 | 2.90 | 26.80 |
| HEMATOCRIT | 1400847.0 | 2.046675e+02 | 100.541769 | 0.00 | 138.000 | 190.50 | 253.00 | 2032.50 |
| PLATELETCOUNT | 1590415.0 | 4.110771e+00 | 0.665436 | 1.00 | 3.700 | 4.05 | 4.50 | 8.80 |
| POTASSIUM | 606029.0 | 3.857429e+01 | 20.817973 | 1.13 | 27.700 | 32.00 | 40.00 | 267.00 |
| PTT | 777721.0 | 1.119279e+00 | 2.107224 | 0.00 | 0.400 | 0.60 | 1.10 | 51.20 |
| SODIUM | 1526870.0 | 1.381931e+02 | 5.907452 | 70.19 | 135.500 | 138.50 | 141.00 | 181.50 |
| WBCCOUNT | 1397388.0 | 1.263588e+01 | 8.543711 | 0.00 | 8.100 | 11.10 | 15.22 | 448.35 |
| VITAL_HOURS | 1835977.0 | 1.159628e+01 | 6.883319 | 0.00 | 6.000 | 12.00 | 18.00 | 23.00 |
| VITAL_HEARTRATE | 1834939.0 | 8.718529e+01 | 19.071587 | 0.00 | 73.500 | 86.00 | 99.50 | 197.00 |
| VITAL_RESPIRATION | 1664359.0 | 1.960662e+01 | 6.402277 | 0.00 | 16.000 | 19.00 | 23.00 | 193.00 |
| VITAL_SAO2 | 1778579.0 | 9.714715e+01 | 3.108405 | 0.00 | 96.000 | 98.00 | 100.00 | 100.00 |
| VITAL_TEMPERATURE | 233682.0 | 3.831585e+01 | 8.968619 | 1.40 | 36.667 | 37.20 | 37.80 | 106.05 |
| VITAL_SYSTEMIC_SYSTOLIC | 505888.0 | 1.186201e+02 | 22.815711 | -50.00 | 103.000 | 116.00 | 132.00 | 392.50 |
| VITAL_SYSTEMIC_DIASTOLIC | 505880.0 | 5.851252e+01 | 13.998958 | -50.00 | 50.000 | 57.00 | 65.00 | 392.00 |
| VITAL_SYSTEMIC_MEAN | 507903.0 | 7.831462e+01 | 18.534002 | -50.00 | 68.000 | 76.00 | 85.50 | 392.50 |
raw_df[raw_df['UNITDISCHARGESTATUS'] =='Expired'].describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| PATIENTUNITSTAYID | 165637.0 | 1.875447e+06 | 975251.133724 | 143056.000 | 1057733.00 | 1707267.00 | 2866028.000 | 3352827.00 |
| LAB_8HOURS | 165515.0 | 8.711355e-01 | 0.813360 | 0.000 | 0.00 | 1.00 | 2.000 | 2.00 |
| ALBUMIN | 82552.0 | 2.727028e+00 | 0.711621 | 0.200 | 2.20 | 2.70 | 3.200 | 5.50 |
| BUN | 131581.0 | 2.256487e+01 | 5.616398 | 4.000 | 19.00 | 22.00 | 26.000 | 51.00 |
| TOTALBILIRUBIN | 137627.0 | 3.462372e+01 | 24.672793 | 2.000 | 18.00 | 28.00 | 44.000 | 233.00 |
| LACTATE | 138147.0 | 1.043977e+02 | 7.731647 | 72.000 | 99.50 | 104.50 | 109.000 | 146.00 |
| BICARBONATE | 137492.0 | 1.889673e+00 | 1.644995 | 0.100 | 0.90 | 1.38 | 2.250 | 25.65 |
| CHLORIDE | 138639.0 | 1.616686e+02 | 81.518815 | 11.000 | 112.00 | 140.00 | 187.000 | 1165.00 |
| CREATININE | 131162.0 | 3.304997e+01 | 7.207123 | 12.900 | 27.65 | 32.40 | 38.000 | 68.00 |
| GLUCOSE | 132111.0 | 1.084988e+01 | 2.396651 | 4.300 | 9.00 | 10.60 | 12.400 | 23.30 |
| HEMOGLOBIN | 82299.0 | 3.739036e+00 | 3.385635 | 0.300 | 1.60 | 2.60 | 4.700 | 34.20 |
| HEMATOCRIT | 125240.0 | 1.927548e+02 | 104.257898 | 3.000 | 120.00 | 182.00 | 248.000 | 924.00 |
| PLATELETCOUNT | 146031.0 | 4.123020e+00 | 0.722133 | 1.800 | 3.65 | 4.05 | 4.550 | 8.20 |
| POTASSIUM | 64003.0 | 4.233725e+01 | 22.858693 | 15.000 | 29.20 | 34.90 | 46.100 | 217.00 |
| PTT | 78885.0 | 1.806129e+00 | 3.830677 | 0.100 | 0.50 | 0.80 | 1.400 | 60.20 |
| SODIUM | 142920.0 | 1.387723e+02 | 6.304121 | 111.000 | 135.00 | 139.00 | 142.000 | 180.50 |
| WBCCOUNT | 124677.0 | 1.429833e+01 | 9.548728 | 0.000 | 8.70 | 12.80 | 17.830 | 186.30 |
| VITAL_HOURS | 165618.0 | 1.159888e+01 | 6.889899 | 0.000 | 6.00 | 12.00 | 18.000 | 23.00 |
| VITAL_HEARTRATE | 165507.0 | 9.104706e+01 | 21.302532 | 0.000 | 75.50 | 90.00 | 105.500 | 191.00 |
| VITAL_RESPIRATION | 152329.0 | 2.106889e+01 | 7.380312 | 0.000 | 16.00 | 20.00 | 25.000 | 141.00 |
| VITAL_SAO2 | 160001.0 | 9.684447e+01 | 4.377797 | 0.000 | 95.00 | 98.00 | 100.000 | 100.00 |
| VITAL_TEMPERATURE | 31200.0 | 3.801164e+01 | 10.802520 | 17.556 | 34.70 | 36.80 | 37.722 | 102.80 |
| VITAL_SYSTEMIC_SYSTOLIC | 65215.0 | 1.153558e+02 | 23.227606 | -40.000 | 99.50 | 113.00 | 129.000 | 287.50 |
| VITAL_SYSTEMIC_DIASTOLIC | 65212.0 | 5.899197e+01 | 14.145284 | -41.000 | 50.00 | 57.00 | 66.500 | 287.50 |
| VITAL_SYSTEMIC_MEAN | 65656.0 | 7.747721e+01 | 18.746303 | -41.000 | 67.00 | 75.00 | 86.000 | 353.00 |
raw_df.columns
Index(['PATIENTUNITSTAYID', 'UNIQUEPID', 'AGE', 'GENDER', 'ETHNICITY',
'UNITDISCHARGESTATUS', 'LAB_8HOURS', 'ALBUMIN', 'BUN', 'TOTALBILIRUBIN',
'LACTATE', 'BICARBONATE', 'CHLORIDE', 'CREATININE', 'GLUCOSE',
'HEMOGLOBIN', 'HEMATOCRIT', 'PLATELETCOUNT', 'POTASSIUM', 'PTT',
'SODIUM', 'WBCCOUNT', 'VITAL_HOURS', 'VITAL_HEARTRATE',
'VITAL_RESPIRATION', 'VITAL_SAO2', 'VITAL_TEMPERATURE',
'VITAL_SYSTEMIC_SYSTOLIC', 'VITAL_SYSTEMIC_DIASTOLIC',
'VITAL_SYSTEMIC_MEAN'],
dtype='object')
cat_cols =[ 'GENDER', 'ETHNICITY']
num_cols = ['LAB_8HOURS','AGE','ALBUMIN', 'BUN', 'TOTALBILIRUBIN',
'LACTATE', 'BICARBONATE', 'CHLORIDE', 'CREATININE','VITAL_HOURS',
'GLUCOSE', 'HEMOGLOBIN', 'HEMATOCRIT', 'PLATELETCOUNT', 'POTASSIUM',
'PTT', 'SODIUM', 'WBCCOUNT','VITAL_HEARTRATE',
'VITAL_RESPIRATION', 'VITAL_SAO2', 'VITAL_TEMPERATURE',
'VITAL_SYSTEMIC_SYSTOLIC', 'VITAL_SYSTEMIC_DIASTOLIC',
'VITAL_SYSTEMIC_MEAN']
unit_discharge_status_counts = raw_df['UNITDISCHARGESTATUS'].value_counts(dropna=True)
# Plot the bar chart
plt.figure(figsize=(10, 6))
unit_discharge_status_counts.plot(kind='bar')
plt.xlabel('Unit Discharge Status')
plt.ylabel('Count')
plt.title('Distribution of Unit Discharge Status')
plt.xticks(rotation=45)
plt.show()
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(10, 10))
axes = axes.flatten()
# pie chart
for i, col in enumerate(cat_cols):
# count of each category
counts = raw_df[col].value_counts()
axes[i].pie(counts, labels=counts.index,
autopct='%1.1f%%', textprops={'fontsize': 8},
startangle=5, colors=sns.color_palette('Set3'))
axes[i].set_title(col)
plt.tight_layout()
plt.show()
raw_df['GENDER'].value_counts(dropna=False)
# Replace 'Unknown' and 'Other' values with mode
raw_df['GENDER'].replace(['Unknown', 'Other'], raw_df['GENDER'].mode()[0], inplace=True)
C:\Users\Duy_Le\AppData\Local\Temp\ipykernel_9772\2052548999.py:3: FutureWarning: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.
For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.
raw_df['GENDER'].replace(['Unknown', 'Other'], raw_df['GENDER'].mode()[0], inplace=True)
raw_df['ETHNICITY'].value_counts(dropna=False)
raw_df['ETHNICITY'].value_counts(dropna=False) / len(raw_df['ETHNICITY'])
ETHNICITY Caucasian 0.771173 African American 0.115601 Other/Unknown 0.046606 Hispanic 0.036148 Asian 0.015361 NaN 0.009789 Native American 0.005323 Name: count, dtype: float64
raw_df['ETHNICITY'] = raw_df['ETHNICITY'].replace({
'African American': 'American',
'Native American': 'American',
'Other/Unknown': 'Others',
'Hispanic': 'Others',
'Asian': 'Others',
np.nan : 'Others'
})
raw_df['AGE'].value_counts()
AGE
> 89 68791
67 52701
72 52057
71 49858
68 49831
...
17 1618
16 425
15 236
14 48
11 24
Name: count, Length: 78, dtype: int64
# Replace '>89' with 90
raw_df['AGE'] = raw_df['AGE'].str.replace('> 89', '90').astype(int)
# Plot histograms
plt.figure(figsize=(20, 20))
num_cols_re = num_cols.copy()
num_cols_re.remove('VITAL_HOURS')
num_cols_re.remove('LAB_8HOURS')
for i, col in enumerate(num_cols_re, 1):
plt.subplot(6, 4, i) # Adjust the number of rows and columns as needed
raw_df[col].dropna().hist(bins=30) # Drop NaN values for histogram
plt.title(col)
plt.xlabel(col)
plt.ylabel('Frequency')
plt.tight_layout()
# Save the plot as a PNG file
plt.savefig('histograms_with_hue.png')
plt.show()
cols_to_fill_mean = ['ALBUMIN', 'BUN', 'LACTATE', 'CREATININE', 'PLATELETCOUNT', 'SODIUM',
'VITAL_SYSTEMIC_SYSTOLIC']
# Create pair plots with hue set to UNITDISCHARGESTATUS
sns.pairplot(raw_df[num_cols_re], diag_kind='hist')
plt.show()